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We develop a feature allocation model for inference on genetic 
tumor variation using next-generation sequencing data. Specifically, 
we record single nucleotide variants (SNVs) based on short reads 
mapped to human reference genome and characterize tumor hetero¬ 
geneity by latent haplotypes defined as a scaffold of SNVs on the 
same homologous genome. For multiple samples from a single tumor, 
assuming that each sample is composed of some sample-specific pro¬ 
portions of these haplotypes, we then fit the observed variant allele 
fractions of SNVs for each sample and estimate the proportions of 
haplotypes. Varying proportions of haplotypes across samples is evi¬ 
dence of tumor heterogeneity since it implies varying composition of 
cell subpopulations. Taking a Bayesian perspective, we proceed with 
a prior probability model for all relevant unknown quantities, includ¬ 
ing, in particular, a prior probability model on the binary indicators 
that characterize the latent haplotypes. Such prior models are known 
as feature allocation models. Specifically, we define a simplified ver¬ 
sion of the Indian buffet process, one of the most traditional feature 
allocation models. The proposed model allows overlapping clustering 
of SNVs in defining latent haplotypes, which reflects the evolutionary 
process of subclonal expansion in tumor samples. 

1. Introduction. We propose a feature allocation model [Broderick, Jor¬ 
dan and Pitman (2013, 2013)] to describe tumor heterogeneity using next- 
generation sequencing (NGS) data. We use a variation of the Indian buffet 
process (IBP) [Griffiths and Ghahramani (2005), Teh, Goriir and Ghahra- 
mani (2007)]. The feature allocation in our model is latent. That is, the 
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features are not directly observed. We record point mutations as single nu¬ 
cleotide variants (SNVs), each of which is defined as a DNA locus that pos¬ 
sesses a variant sequence from that on the reference human genome. We use 
the feature allocation model to describe unobserved haplotypes, defined as 
a collection of single nucleotide variants (SNVs) scaffolded on a homologous 
genome. In a tumor sample, having more than two haplotypes is evidence of 
heterogeneous cell subpopulations with a distinct genome. This is the case 
because humans are diploid and we would therefore only observe up to two 
haplotypes if all cells in a tumor sample were genetically homogeneous. In 
the proposed application of feature allocation models to inference for tumor 
heterogeneity, the haplotypes are the features and the SNVs are the exper¬ 
imental units that select the features. The number of features is unknown. 
Each tumor sample is composed of an unknown proportion of each of these 
haplotypes. The top level sampling model for the observed SNV counts is 
then defined as binomial sampling with a proportion for each SNV that is 
implied by this composition. In summary, we solve a deconvolution problem 
to explain the observed SNV frequencies for each sample by compositions of 
latent haplotypes. 

Heterogeneity in cancer tissue has been hypothesized over the past few 
decades [Wersto et al. (1991)] and has been demonstrated elegantly using 
NGS technology over the past few years [Gerlinger et al. (2012)]. Genetic 
variation in a tumor occurs due to evolutionary processes that drive tu¬ 
mor progression. Specifically, tumors include distinct clonal subpopulations 
of cells that arise stochastically by a sequence of randomly acquired mu¬ 
tations. Substantial genetic heterogeneity between tumors (inter-tumor) or 
within a tumor (intra-tumor) can be explained by differences in clonal sub¬ 
populations and varying proportions of those subpopulations [Marusyk and 
Polyak (2010), Russnes et al. (2011), Landau et al. (2013)]. For example, 
Navin et al. (2010) reported clonal genomic heterogeneity in breast cancers. 

Data derived from NGS experiments include SNVs, small indels and copy 
number variations [Wheeler et al. (2008), Ng and Kirkness (2010)]. Many 
researchers use SNV data to investigate genes and genomic regions related 
to cancer phenotypes [Erichsen and Chanock (2004), Engle, Simpson and 
Landers (2006)]. In this paper, we utilize whole-genome sequencing data 
measuring variant allele fractions (VAFs) at SNVs to understand tumor 
heterogeneity by proposing inference on how haplotypes may be distributed 
within a tumor. 

In an NGS experiment, millions of short DNA reads are generated and are 
then aligned to the reference genome. At certain positions of the genome, 
some or all of the mapped reads will show a sequence different from what is 
in the reference genome. At each genomic locus, the proportion of short reads 
bearing a variant sequence is called the VAF. If the VAF at a locus is nonzero, 
an SNV may be “called” at that locus, based on statistical inference [Li et al. 
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(2009)]. The raw experimental data comprises the total number of reads {N) 
that are mapped to the locus and the number of those reads (n) indicating a 
variation from the reference sequence. Then VAF = n/N. If a tumor sample 
is homogeneous, that is, having a single clone, the VAF values of all the SNVs 
should be close to 0, 0.5 or 1, reflecting the three possible homozygous and 
heterozygous alleles (i.e., AA, AB, BB) at any SNV. Different VAF values 
imply heterogeneity of the cellular genome in the tumor sample (see Figure 1 
for an example). We propose to study inference to deconvolute the VAFs 
from multiple SNVs and back out the latent haplotypes. 

We propose a Bayesian feature allocation model to characterize such cel¬ 
lular heterogeneity in a way that explains the observed NGS data. We con¬ 
struct a matrix of binary features (equivalently, haplotypes) as shown in 
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(a) Multiple haplotypes as evidence of a heterogeneous tumor 
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(b) Hypothetical short reads data 

Fig. 1. A hypothetical example explaining how NGS data can he used to infer heteroge¬ 
neous tumor samples, (a) shows that there are two subclones (cell subpopulations) in the 
tumor sample with different haplotypes consisting of two SNVs: For subclone 1, there are 
two haplotypes, AT and GT. For subclone 2, there is only one haplotype, AG. Thus, there 
are a total of three haplotypes in the tumor sample, implying heterogeneous cell populations 
since a population of homogeneous cells would only support up to two haplotypes. Here se¬ 
quence G for SNV 1 and sequence G for SNV 2 are mutations, (b) shows hypothetical short 
reads for this sample if it is sequenced, assuming that the proportions of the two subclones 
are equal. The short reads counts are summarized as observed VAFs, which are used for 
our statistical inference. 
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Haplotype 



Eig. 2. An illustration of cell types (binary latent features) in columns. A hlack/white 
block indicates a variant/reference DNA sequence at the corresponding SNV (row) for the 
haplotype (column). 

Figure 2. In the figure, columns correspond to haplotypes and rows corre¬ 
spond to SNVs. We define haplotype c by a binary vector {zic ,... ,^s'c) of 
indicators of whether (zgc = 1) or not {zgc = 0) a variant sequence is observed 
at the SNV s. Here we view SNV as a genetic locus on which either a vari¬ 
ant or reference DNA sequence could be observed. Figure 2 illustrates the 
definition of five haplotypes (C = 4, columns) with 5 = 10 SNVs (rows). In 
the figure, black (white) indicates = 1 (^sc = 0). For example, SNV 1 in 
Figure 2 possesses a variant sequence in the two haplotypes c = 0 and c = 1. 
On the other hand, SNV 9 possesses variant sequences in four haplotypes: 
c = 0,1,2 and 4. A prior probability model on such a binary matrix Z = [z^c] 
is known as a feature allocation model. Here, we assume that C is unknown 
and place a prior on C. 

Assuming that samples are composed of different proportions of C hap¬ 
lotypes, we aim to fit the observed VAFs of the SNVs to infer these propor¬ 
tions. For example, we may observe that one sample is dominated by hap¬ 
lotypes 1 and 4, while another is dominated by haplotypes 2 and 3. If the 
samples are from the same tumor, the differences in haplotypic compositions 
are evidence of intra-tumor heterogeneity; on the other hand, differences in 
samples from different tumors imply inter-tumor heterogeneity. Therefore, 
the proposed inference provides a unified framework to address inference for 
both biological concepts. Importantly, the characterization of haplotypes is 
based on selected SNVs only. Otherwise inference for heterogeneity between 
tumors in different patients would not be biologically meaningful, as cellular 
genomes and haplotypes are not expected to be shared across patients. How¬ 
ever, for tumors in the same class of disease, SNVs in local disease-related 
genomic regions may be common to all or some of the tumors, thereby al¬ 
lowing for the proposed inference. 
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There are currently few approaches that address the problem of tumor 
heterogeneity. Su et al. (2012) and Larson and Fridley (2013) recognized that 
a tumor sample is a mixture of normal cells and tumor cells, and developed 
a method to estimate tumor purity levels for paired tumor-normal tissue 
samples using DNA sequencing data. None of the two methods considers 
more than two samples or unpaired samples. PurBayes [Larson and Fridley 
(2013)] accounts for intra-tumor heterogeneity, but it does not provide infer¬ 
ence on the subpopulation configurations as inference on the latent matrix 
Z under the proposed model. PyClone [Roth et al. (2014)], a recently pub¬ 
lished method, proposes inference to cluster SNVs with different VAFs. An 
underlying assumption of PyClone is that SNVs can be arranged in clusters 
that inform about subclones. A key component of PyClone is the use of 
clustering models such as the Dirichlet process for inference on these clus¬ 
ters. While such clusters are informative about heterogeneity, inference that 
is provided by PyClone is not meant to identify subclones or haplotypes. 
The primary aim of PyClone is inference on mutation clusters, defined as a 
group of SNVs with similar variant allele fractions. 

In contrast, our proposed feature allocation model explicitly models the 
haplotypic genomes of subclones, allowing overlapping SNVs shared between 
different subclones. We do not use nonoverlapping SNV clusters as the build¬ 
ing block for subclones. That is, instead of first estimating the SNV clusters 
and then constructing subclones based on clusters, we directly infer the 
subclonal structure based on haplotypes. We show in later examples the 
distinction between PyClone and our proposed method. 

The remainder of the paper is organized as follows: Section 2 describes the 
proposed Bayesian feature allocation model and a model selection criterion 
to select the number of subclones. Section 3 describes simulation studies. 
Sections 4 and 5 report data analyses of in-house data sets to illustrate 
inter-tumor heterogeneity and intra-tumor heterogeneity, respectively. The 
last section concludes with a final discussion. 

2. Probability model. 

2.1. Sampling model Let n denote an 5 x T matrix of count data from 
an NGS genome sequencing experiment, with Ust denoting the number of 
reads that bear a variant sequence at the location of SNV s for tissue sample 
t, 5 = 1,..., 5 and t = 1,..., T. We assume a binomial sampling model. Let 
Nst denote the total number of reads in sample t that are mapped to the 
genomic location of SNV s. We assume 

( 1 ) nst^in{Nst,Pst)- 

In Figure 3, Ust = ^ and Ngt = 5. We do not model Ngt^ that is, we treat 
Nst as fixed, and only consider a sampling model for Ust conditional on 
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* 



Eig. 3. An illustration of the Binomial model. The illustration shows that 5 short reads 
are mapped to a position marked with * and among them three reads indieate variation at 
the position, that is, Nst = 5 and Ust = 3. 


Nst (modeling Ngt would not contribute any information on tumor hetero¬ 
geneity based on SNVs). Conditional on Nst^ the observed counts rtst are 
independent across s and t. The model in (1) is illustrated in Figure 3. 

2.2. Prior. We build a prior probability model for pst in two steps, using 
the notion that each sample is composed of a mixture of different haplotypes. 
And each haplotype, in turn, is characterized by the haplotypes consisting 
of the SNVs. Let wtc denote the proportion of haplotype c in sample t and 
let ^ {0,1} denote a latent indicator of the event that SNV s bears a 
variant sequence for haplotype c. Note that = 1 corresponds to a black 
block in Figure 2. Then pst is written as a sum over C latent haplotypes 

C C 

(2) Pst ~ “1“ ^ ^ "^tc^sc = ^tt) ^ ^ ^^tc^sc' 

C=1 C=1 

The construction of the haplotypes, including the number of haplotypes, C, 
and the indicators are latent. The key term, Ylc=i "^tc^tc^ indirectly infers 
haplotypes by explaining pst as arising from sample t being composed of a 
mix of hypothetical haplotypes which do {zsc = 1) or do not {zsc = 0) include 
a mutation for SNV s. The indicators are collected in a (*S x C) binary 
matrix Z. The number of latent haplotypes, C, is unknown. Conditional 
on C, the binary matrix Z describes C latent tumor haplotypes present 
in the observed samples. Joint inference on C, Z and explains tumor 
heterogeneity. 

In addition, we introduce a background haplotype, labeled as haplotype 
c = 0, which includes all SNVs. The background haplotype accounts for 
experimental noise and haplotypes that appear with negligible abundance. 
Specifically, eto = wtopo in (2) relates to this background haplotype, with po 
being the relative frequency of observing a mutation at an SNV due to noise 
and artifact (we assume equal frequency for all SNVs) and wto being the 
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proportion in sample t. The prior on wto is defined later. For po? we assume 
Po ^ Be(aoo, ^oo) with aoo "C 600 to inform a small po value a priori. 

We start the prior construction with a prior for the number of haplotypes, 
C. We consider a geometric distribution, C ^ Geometric(r) where E((7) = 
1 /r. Conditional on C, we use a feature allocation model for a binary matrix 
Z. We first define the model for any given C and start with feature-specific 
selection probabilities, 

(3) MclcT^-BeKC,!). 

Let /X = (/ii,..., Pc) - The selection probabilities are used to define p(Z|/x, C) 
as 

SC c 

(4) p(zi/x,c)=n - Me)=nMr(i - 

S = 1 C=1 C=1 

where rric = ^s=i is the number of SNVs bearing variant sequences for 
haplotype c. A limit of the model, as (7 ^ oc, becomes a constructive defi¬ 
nition of the Indian buffet process (IBP) [Griffiths and Ghahramani (2005), 
Teh, Goriir and Ghahramani (2007)]. The model is symmetric with respect 
to arbitrary indexing of the SNVs, simply because of the symmetry in (4) 
and (3). Note that rUc = 0 is possible with positive prior probability. 

Next, we consider a prior distribution for abundances associated with the 
haplotypes defined by Z. The haplotypes are common for all tumor samples, 
but the relative weights in the composition ( 2 ) are different across tissue 
samples. We assume Dirichlet priors for the relative weights wte^ defined as 
follows. Let 9tc denote an (unsealed) abundance level of haplotype c in tissue 

sample t. We assume Gamma(a, 1) for c=l,...,(7 and 

Gamma(ao, 1). We then define 

— ^tc j ^ ^ 

c'=0 

as the relative weight of haplotype c in sample t. This is equivalent to 

Dir(ao,a,... ,a) for t = 1,... ,r, where . .,wtc)- 

Recall the binomial sampling likelihood (1) with success probability, Pst- 
Given (7, Z and w, we define Pst in (2). In words, Pst is determined by (7, 
Z and wt with the earlier describing the latent haplotypes and the latter 
specifying the relative abundance of each haplotype in sample t. 

2.3. Posterior simulation. Let x= (Z, 0 ,po )5 where 9 = {Otc}- Markov 
chain Monte Carlo (MCMC) posterior simulation proceeds by sequentially 
drawing random numbers for the parameters in x. Given (7, such MCMC 
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simulation is straightforward. Specifically, Gibbs sampling transition proba¬ 
bilities are used to update and Metropolis-Hastings transition probabil¬ 
ities are used to update 0 and pq. It is possible to improve the mixing of the 
Markov chain by updating all columns in row s of the matrix Z jointly by 
means of a Metropolis-Hastings transition probability that proposes changes 
in the entire row vector Zg. 

The construction of transition probabilities that involve a change of C is 
more challenging, since the dimension of Z and 0 changes as C varies. We 
use a reversible jump (RJ) MCMC algorithm for posterior simulation [Green 
(1995)]. We first define a proposal distribution q{C^C) for C, and then in¬ 
troduce a proposal distribution g(x|C) for x conditional on the proposed 
C. The latter potentially involves a change in dimension of the parameter 
vector. We found that high posterior correlation of Z and w conditional on 
C greatly complicated the construction of a practicable RJ scheme. To over¬ 
come this, we use an approach similar to Casella and Moreno (2006). We split 
the data into a minimal training set (n', N') with ^st — 

and a test data set, (n",N") with = (1 ~ bst)nst etc. In the implementa¬ 
tion we use bst generated from Be(25,975). Let pi(x|C) =p(x|n',(7) denote 
the posterior distribution under C using the training sample. We use pi in 
two instances. First, we replace the original prior p(x|C) by pi(x|C) and, 
second, we also use it as proposal distribution q{Si\C) —pi{Si\C). The test 
data is then used to evaluate the acceptance probability. The strategy can 
be characterized as model comparison by fractional Bayes factors [O’Hagan 
(1995)] and is related to a similar approach proposed in Casella and Moreno 
(2006) for model comparison with intrinsic Bayes factors. Both are origi¬ 
nally proposed for model comparison with noninformative priors, but can 
be modified to facilitate MCMC across models as we need it here. 

We summarize the joint posterior distribution, p(C, Z, w,po|n), by factor¬ 
izing it as p(C|n)p(Z|n, C)p(w,po|n,C, Z). Based on the available posterior 
Monte Carlo sample, we (approximately) evaluate the marginal posterior 
distribution for C and determine the maximum a posteriori (MAP) esti¬ 
mate C^. We then estimate Z conditional on as follows: For any two 


— 


We 


matrices, Z and Z', 1 < c,c' < let Vcc'i'Zi^Z') = J2s=i I 
then define a distance (i(Z,Z') = min^^^ Pc, 7 rc(Z, Z'), where tTc is a per¬ 
mutation of {!,..., C'^} and the minimum is over all possible permutations. 
A posterior point estimate for Z is defined as 

Z^ = argrnin f (i(Z, Z^) dp{Z\n^ C'^) ^ argn^in ^ 7l\ 

^ ^ i=i 

for a posterior Monte Carlo sample, = 1 ,... ,L}. Finally, we report 

posterior point estimates w"*" and pg for w and po conditional on and 

Z-. 
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3. Simulation. We validated the proposed model in a simulation study. 
We simulated a set of S' = 100 SNVs with T = 30 samples. In the simu¬ 
lation truth, we assumed four latent haplotypes (^(^true _ ^g ^ 

background haplotype (c = 0) with all SNVs bearing variant sequences. Hap- 
lotype c = 1 has variant sequences for the first 15 SNV positions, haplotype 
2 for the first 20 SNV positions, haplotype 3 for the first 85 positions and 
haplotype 4 for the first 90 positions. In other words, SNVs 1-15 bear vari¬ 
ant sequences for all four haplotypes, SNVs 16-20 for haplotypes 2-4, SNVs 
21-85 for haplotypes 3-4, SNVs 86-90 for haplotype 4 only and SNVs 91- 
100 for none of the haplotypes, as shown in Figure 4(a). The green color in 
panel (a) implies presence (zsc = 1) of a variant sequence at SNV s for hap¬ 
lotype c and the red color shows absence (zgc = 0), for c = 1,... ,4 and s = 
1,..., 100. We then generated as follows. We let = (8,6,3,1) 

and for each t randomly permuted Let denote a random 

permutation of a™^^. We generated ^ Dir(0.2, a™^^). That is, 

the first parameter of the Dirichlet prior for the ((^true 1)-dimensional 
weight vector was 0.2, and the remaining parameters were a permutation of 
aTRUE Using the assumed and w^R^^ and letting pgR^^ = 0.01 

and Nst = 50 for all t and 5, we generated rist with 

^true _ ^TRUE^TRUE Ylc=i weights w^RRR are shown 

in Figure 4(b). Similar to the heatmap of Z^R^^, the green color in panel (b) 
represents high abundance of a haplotype in a sample and the red color low 
abundance for c = 0,..., 4 and t = 1,..., 30. For haplotype 0 the heatmap 
plots wtoPQ. The samples in rows are rearranged for better display. 


SNV 


samples 



Eig. 4. Heatmaps of and in the simulation. 
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To fit the proposed model, we took r = 0.2, a = 3, ao = 0.5, a = 0.5, aoo = 
1 and boo = 100. For each value of C, we initialized Z using the observed 
sample proportions. We generated initial values for 9tc and po by prior draws. 

We generated bgt ^ ^ Be(25,975) to construct the minimal training set and 
ran the MCMC simulation over 25,000 iterations, discarding the first 10,000 
iterations as initial burn-in. 

Figure 5(a) reports the posterior distribution of C in which the dashed 
vertical line represents the true value C^true _ ^ posterior mode = 
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Eig. 5. Posterior inference for the simulated data. 
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4 recovers the truth. We then find the posterior point estimates of Z, w 
and po conditional on as described in Section 2.3. We compared 
with the posterior estimates Pst = Po'^^o + Ylc=i'^tc^sc' Figure 5(b) shows 
the histogram of the errors {pst — Fitting appears to be great as 

{Pst — Pst) centers at 0. Figure 5(c) and (d) show heat maps for and 
(given C'^ = 4). The estimate Z^ in Figure 5(c) places SNV 86-90 into 
haplotypes 3 and 4. The latter are two identical haplotypes. This may be 
because for haplotype c = 3 is small for almost all the samples. The 

weights for the two dominant subclones, for c= 1,2, are close to the 
simulation truth, and for c = 3,4 are closer to the average of 
c = 3,4. 

For comparison, we implemented PyClone [Roth et al. (2014)] with the 
same simulated data. We used the infinite beta-binomial mixture model in 
PyClone assuming that the copy number at mutation positions is known. 
Figure 6(a) shows the estimated variant allelic prevalence for each mutation 
for each sample under PyClone. Columns are samples and rows are SNVs. 
The white horizontal lines separate the estimated SNV clusters. PyClone 
identified four clusters of SNVs: cluster 1 with SNV 1-20, cluster 2 with 
SNV 21-85, cluster 3 with SNV 86-90 in cluster 3, and cluster 4 with SNV 
91-100. 

The estimated cluster 1 includes the SNVs that under the simulation 
truth appear in all the true haplotypes or in true haplotypes 2-3; cluster 2 
includes the SNVs from true haplotypes 3-4; cluster 3 includes SNVs from 


SNV 



samples 

Color Key 



0.2 0.4 0.6 0.8 
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(a) Heatmap of cellular prevalence 



(b) Mean prevalence for each cluster 


Fig. 6. Estimated cellular prevalence of four SNV clusters over samples by PyClone for 
the simulated data. 
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true haplotype 4; and cluster 4 includes SNVs that appear in none of the true 
haplotypes. Figure 6(b) shows estimated mean cellular prevalence of each 
cluster across the 30 samples. In summary, the reconstruction under PyClone 
is reasonable, but stops short of recovering the true subclones, which cannot 
possibly be represented as the assumed nonoverlapping clusters. 

Finally, we carried out another simulation to the sensitivity of the pro¬ 
posed inference to different assumptions on experimental noise. In particular, 
we considered experimental noise that varies across SNVs, as it could arise 
from potential bias or errors in data preprocessing, including sequencing 
bias, mapping bias, errors in variant calling etc. Details of the simulation 
study are reported in the supplementary material [Lee et al. (2015)]. Brieffy 
summarized, in the simulation truth we replaced the error term £to in (2) 
by an SNV-specific term £ts- But we continued to fit the model with the 
common £to^ ns in (2). We still find reasonable posterior inference. 

For details, refer to the supplemental material. 

4. Pancreatic cancer data. We analyzed NGS data obtained from exome 
sequencing of five samples of pancreatic ductal adenocarcinoma (PDAC) 
patients at NorthShore hospital. PDAC is a particularly aggressive tumor 
with median survival of less than a year. We extracted genomic DNA from 
each tissue and constructed an exome library from these DNA using Agilent 
SureSelect capture probes. The exome library was then sequenced in paired- 
end fashion on an Illumina HiSeq 2000 platform. About 60 million reads— 
each 100 bases long—were obtained. Since the SureSelect exome was about 
50 Mega bases, raw (pre-mapping) coverage was about 120-fold. We then 
mapped the reads to the human genome (version HG19) [Church et al. 
(2011)] using BWA [Li and Durbin (2009)] and called variants using GATK 
[McKenna et al. (2010)]. Post-mapping, the mean coverage of the samples 
was between 60 and 70 fold. 

A total of nearly 115,000 SNVs and small indels were called within the 
exome coordinates. We restricted our attention to SNVs (i) that occur within 
genes that are annotated to be related to PDAC in the KEGG pathways 
database [Kanehisa et al. (2010)], (ii) that make a difference to the protein 
translated from the gene, and (hi) that exhibit significant coverage in all 
samples. This filtering left us with S — 118 SNVs. 

In summary, using the earlier introduced notation, the data record the 
read counts (Ngt) and mutant allele read counts (rist) of 5 = 118 SNVs from 
T = 5 tumor samples. Figure 7 shows a summary of the data. The large 
Nst values make the binomial likelihood very informative. For the prior 
specification, we let r = 0.2, a — 1^ a = l, ao = l, aoo = 5 and boo — 95. We 
generated bgt from Be(25,975) for the minimal training set. We ran a MCMC 
posterior simulation over 35,000 iterations, discarding an initial transient of 


BAYESIAN MODEL EOR TUMOR HETEROGENEITY 


13 




Eig. 7. Pancreatic cancer data: The left panel shows a histogram of the total number 
of mapped reads, Nst, and the right panel shows a histogram of the empirical fractions, 

'^st /-Y st • 

10,000 iterations. Figure 8(a) shows the marginal posterior distribution for 
C. The posterior mode is = 4. 

The posterior point estimate of Z conditional on is shown in Fig¬ 
ure 8(b) and the corresponding posterior point estimate of w in Figure 8(c). 
Here, green represents a variant sequence and red represents a reference se¬ 
quence. We find that each sample has two or three two dominant haplotypes, 
that is, two green columns for each row in the heat map. Haplotypes 2, 3, 4 
are shared by different sets of the five samples. For example, sample 2 has a 
large-scaled abundance level for haplotypes 1, 2 and 3. Sample 4 is mainly 
dominated by haplotypes 1 and 3. 

These results indicate that while tumors are unique, there are haplotypes 
that do recur across different patients. The results also clearly show that 
each tumor (in this data) is made of more than one haplotype: usually two 
or three dominant haplotypes and other minor types. To our knowledge, this 
is the first attempt to analyze the internal clonal composition of multiple 
PD AC tumor samples based on NGS data. 

For comparison, we also evaluated tumor heterogeneity for the same pan¬ 
creatic cancer data using PyClone [Roth et al. (2014)]. The results are shown 
in Figure 9. The posterior estimated clustering includes 24 SNV clusters, 
shown in panel (a). The estimated mean cellular prevalences of each clus¬ 
ter across the five samples are shown in (b). The estimated mean cellular 
prevalences vary substantially across samples. 

5. Lung cancer data. We record whole-exome sequencing for four surgi¬ 
cally dissected tumor samples taken from the same patient with lung cancer. 
The same bioinformatics preprocessing and analysis were carried out as in 
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Fig. 8. Pancreatic cancer data: The posterior distribution of C in (a), the heatmaps of 
Z* and with C* = 4 m (b) and (c), respectively. Note that for c = 0, PoZ^q is illustrated 
in the first column of panel (c). 




the previous pancreatic cancer example. We obtained SNVs and filtered 
them based on criteria similar to the previous example, leaving us in the 
end with S = 101 SNVs for the four intra-tumor samples. 

We estimated the proposed Bayesian feature allocation model with the 
same hyperparameters as in the previous PDAC data analysis. Figure 10 
summarizes the inference results. Panel (a) shows the marginal posterior 
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Fig. 9. Pancreatic cancer data: Estimated cellular prevalence of SNV clusters over sam¬ 
ples by PyClone. 


distribution for (7, identifying a posterior mode at — 3, that is, three 
distinct haplotypes. Panel (b) shows the posterior point estimate, Z^, con¬ 
ditional on = 4. The figure shows which SNVs are included for each of 
the three haplotypes. Haplotype 3 contains the smallest number of muta¬ 
tions (green bars), implying that haplotype 3 might be the parental tumor 
cells. Haplotypes 1 and 2 are descendants of haplotype 3 with additional 
somatic mutations. Phylogenetically, a simple lineage can be hypothesized, 
with haplotype 3 as the parent of haplotype 1 and/or haplotype 2. Hap¬ 
lotype 2 possesses a large number of new somatic mutations, potentially 
representing a type of aggressive tumor cell. Panel (c) presents the poste¬ 
rior point estimate of w, w"*" with and Z'^. Examining haplotypes 1-3, 
we found that, interestingly, all four tumor samples share similar values of 
w/, implying a lack of spatial heterogeneity across the tumor samples. In 
other words, these samples all possess the inferred three tumor haplotypes 
in panel (b) with a similar composition. 

Again, for comparison we also used PyClone [Roth et al. (2014)] with 
the lung cancer data. The results are shown in Figure 11. The estimated 
clustering identified six clusters of mutations. The mean prevalences within 
a mutation cluster are similar across samples. 

6. Conclusions. Tumors are heterogeneous tissues. The traditional way 
to identify this heterogeneity has been to sequence multiple samples from 
the tumor. Using such data to study the coexistence of genetically differ- 
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Fig. 10. Lung cancer data: The posterior distribution of C in (a), the heatmaps of 7T 
and w* with C* = 3 in (b) and (c), respectively. Note that for c = 0, PoZ^q is illustrated 
in the first column of panel (c). 


ent subpopulations across tumors and within a tumor can shed light on 
cancer development. Identifying subpopulations within a tumor can lead 
to clinically important insights. For example, Landau et al. (2013) found 
that a chemotherapy affects subclonal heterogeneity in chronic lymphocytic 
leukemia. They also observed that the presence of a certain subpopulation 
may adversely affect clinical outcome. 














BAYESIAN MODEL EOR TUMOR HETEROGENEITY 


17 


SNV 



■I- C\J CO 

OD CO , CO CO 

samples 

Color Key 

c ^ _ 



0.2 0.4 0.6 0.8 

Value 

(a) Heatmap of cellular prevalence 


00 

o 


CO 

CO 


CM 

d 


o 

d 



H-1 -1-1-1-1-1—^ 

1.0 1.5 2.0 2.5 3.0 3.5 4.0 


Sample 

(b) Mean prevalence for each cluster 


Fig. 11. Lung cancer data: Estimated cellular prevalence of SNV clusters over samples 
by PyClone. 


We have proposed a model-based approach based on a feature allocation 
model. The feature allocation model allows us to impute inference about 
different components of tumor tissues based on NGS data. The identified 
components are not necessarily unique because there might be other possible 
solutions which can lead to the same hypothetical mutation frequencies. 
Instead of reporting a single solution, the proposed approach provides a 
full probabilistic description of all possible solutions as a coherent posterior 
probability model over C, Z and w. 

A number of extensions are possible for the present model. First, the 
number of SNVs examined in this paper was relatively limited (about 100), 
although the total number of SNVs that were found in the whole exome of a 
tissue is on the order of about 50,000. Other than computational complexity, 
there is no bar in principle on expanding the model to analyze the whole 
SNV complement of the exome. It could also be instructive to quantify the 
cellular diversity of the tumor based on findings from various regions of the 
exome. 

Another important extension of the model is in the basic representation of 
subclones and haplotypes. The current model uses a binary matrix to record 
whether a variant sequence for an SNV is present or absent in a haplotype. 
A variation of the model could instead record for each subclone whether an 
SNV is absent [zgc — 0), heterozygous [zgc = 1) or homozygous [zgc — 2). 
That is, Z would become a trinary matrix. Other extensions of the model 
are to consider each SNV position to have four possible bases, A,(7, G,T, 
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to introduce dependence among mutations or to formally model the noise 
in variant calling. Each of these extensions is currently in development. For 
example, incorporating explicit error probabilities in variant calls is possible. 
Similar to our previous work [Ji et al. (2011)], we could replace the binomial 
likelihood (1) in the proposed model with a Bernoulli likelihood, for each 
read, where the probability associated with a read depends on quality scores 
of base calling and read mapping. We will consider this extension as future 
work. 

Tumor genome sequencing projects have typically looked for specific genes 
to be mutated or not. The inherent assumption here, so far unproven, is 
that the overall effect of carcinogenesis could be explained by a handful of 
changes in a small number of genes. Our model takes the opposite approach 
and allows us to examine the whole genome (or exome) and, by considering 
VAF patterns, to construct reasonable models for the tissue. We believe this 
holistic approach to the analysis might provide more robust conclusions and 
biomarkers than the gene-by-gene approach. 

SUPPLEMENTARY MATERIAL 

Supplement to ‘‘A Bayesian feature allocation model for tumor hetero¬ 
geneity” (DOI: 10.1214/15-AOAS817SUPP; .pdf). The supplementary ma¬ 
terial includes the second simulation study. 
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